By Andreas Troxler, June 2022
In this Part II of the tutorial, you will learn techniques that can be applied in situations with few or no labels. This is very relevant in practice: text data is often available, but labels are missing or sparse!
Let’s get started.
This notebook is divided into tutorial is divided into five parts; they are:
Introduction.
We begin by explaining pre-requisites. Then we turn to loading and exploring the dataset – ca. 6k records of short property insurance claims description which we aim to classify by peril type.
Classify by peril type in a supervised setting.
To warm up, we apply supervised learning techniques you have learned in Part I to the dataset of this Part II.
Zero-shot classification.
This technique assigns each text sample to one element of a pre-defined list of candidate expressions. This allows classification without any task-specific training and without using the labels. This fully unsupervised approach is useful in situations with no labels.
Unsupervised topic modeling by clustering of document embeddings.
This approach extracts clusters of similar text samples and proposes verbal representations of these clusters. The labels are not required, but may be used in the process if available. This technique does not require prior knowledge of candidate expressions.
This notebook is computationally intensive. We recommend using a platform with GPU support.
We have run this notebook on Google Colab and on an Amazon EC2 p2.xlarge instance (an older generation of GPU-based instances).
Please note that the results may not be reproducible across platforms and versions.
Make sure the following files are available in the directory of the notebook:
tutorial_utils.py - a collection of utility functions used throughout this notebookperil.training.csv - the training dataperil.validation.csv - the validation dataThis notebook will create the following subdirectories:
models - trained Transformer modelsresults - figures and Excel filesFor this tutorial, we assume that you are already familiar with Python and Jupyter Notebook. We also assume that you have worked through Part I of this tutorial.
In this section, Jupyter Notebook and Python settings are initialized. For code in Python, the PEP8 standard ("PEP = Python Enhancement Proposal") is enforced with minor variations to improve readability.
# Notebook settings
# clear the namespace variables
from IPython import get_ipython
get_ipython().run_line_magic("reset", "-sf")
# formatting: cell width
from IPython.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
If you run this notebook on Google Colab, you will need to install the following libraries:
!pip install datasets
!pip install transformers
!pip install plotly
!pip install kaleido
!pip install pyyaml==5.4.1 ## https://github.com/yaml/pyyaml/issues/576
!pip install bertopic
and loaded:
import os
from collections import OrderedDict
import pandas as pd
import numpy as np
from scipy.special import softmax
from datasets import Dataset, DatasetDict
from transformers import AutoTokenizer, AutoModel, Trainer, TrainingArguments, trainer_utils, AutoModelForSequenceClassification
from transformers import pipeline
import torch
from sklearn.metrics import accuracy_score, f1_score
import plotly.express as px
from wordcloud import WordCloud
from bertopic import BERTopic
from umap import UMAP
from hdbscan import HDBSCAN
from tutorial_utils import extract_sequence_encoding, get_xy, dummy_classifier, logistic_regression_classifier, evaluate_classifier
The dataset used throughout this tutorial concerns property insurance claims of the Wisconsin Local Government Property Insurance Fund (LPGIF), made available in the open text project of Frees. The Wisconsin LGPIF is an insurance pool managed by the Wisconsin Office of the Insurance Commissioner. This fund provides insurance protection to local governmental institutions such as counties, schools, libraries, airports, etc. It insures property claims at buildings and motor vehicles, and it excludes certain natural and man-made perils like flood, earthquakes or nuclear accidents.
The data consists of 6’030 records (4’991 in the training set, 1’039 in the test set) which include a claim amount, a short English claim description and a hazard type with 9 different levels: Fire, Lightning, Hail, Wind, WaterW (weather related water claims), WaterNW (other weather claims), Vehicle, Vandalism and Misc (any other).
The training and validation set are available in separate csv files, which we load into Pandas DataFrames,
create a single column containing the label, and finally create a dataset.
# load data
df_train = pd.read_csv("peril.training.csv")
df_valid = pd.read_csv("peril.validation.csv")
# extract label texts and create column "labels" which encodes the peril
labels = df_train.columns[:9].to_list()
df_train["labels"] = np.matmul(df_train.iloc[:, :9].values, np.array(range(9),).reshape((9,1)))
df_valid["labels"] = np.matmul(df_valid.iloc[:, :9].values, np.array(range(9),).reshape((9,1)))
# create dataset
ds = DatasetDict({"train": Dataset.from_pandas(df_train), "test": Dataset.from_pandas(df_valid)})
print(f"{ds}")
DatasetDict({
train: Dataset({
features: ['Vandalism', 'Fire', 'Lightning', 'Wind', 'Hail', 'Vehicle', 'WaterNW', 'WaterW', 'Misc', 'Loss', 'Description', 'labels'],
num_rows: 4991
})
test: Dataset({
features: ['Vandalism', 'Fire', 'Lightning', 'Wind', 'Hail', 'Vehicle', 'WaterNW', 'WaterW', 'Misc', 'Loss', 'Description', 'labels'],
num_rows: 1039
})
})
df_train.head()
| Vandalism | Fire | Lightning | Wind | Hail | Vehicle | WaterNW | WaterW | Misc | Loss | Description | labels | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 6838.87 | lightning damage ... | 2 |
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 2085.00 | lightning damage at Comm. Center ... | 2 |
| 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 11335.00 | lightning damage at water tower ... | 2 |
| 3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1480.00 | lightning damge to radio tower ... | 2 |
| 4 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 600.00 | vandalism damage at recycle center ... | 0 |
Let's look at the distribution of peril types in the training and validation set:
stats = pd.DataFrame({
"peril": df_train.columns.values[:-3],
"train": df_train.groupby("labels")["labels"].count().values,
"valid": df_valid.groupby("labels")["labels"].count().values
})
summary = pd.DataFrame({"peril": ["Total"], "train": [stats["train"].sum()], "valid": [stats["valid"].sum()]})
stats = pd.concat([stats, summary], ignore_index=True)
stats
| peril | train | valid | |
|---|---|---|---|
| 0 | Vandalism | 1774 | 310 |
| 1 | Fire | 171 | 46 |
| 2 | Lightning | 832 | 123 |
| 3 | Wind | 296 | 107 |
| 4 | Hail | 76 | 18 |
| 5 | Vehicle | 852 | 227 |
| 6 | WaterNW | 202 | 67 |
| 7 | WaterW | 426 | 38 |
| 8 | Misc | 362 | 103 |
| 9 | Total | 4991 | 1039 |
fig = px.bar(df_train["labels"].value_counts().sort_index()+df_valid["labels"].value_counts().sort_index(), width=640)
fig.update_layout(title="number of claims by peril type", xaxis_title="peril type",
yaxis_title="number of claims")
fig.show(config={"toImageButtonOptions": {"format": 'svg', "filename": "peril_type"}})
Next, we want to see some statistics on the length of the claims descriptions. To this end, we split the texts into words, with blank spaces as separator. The text length averages to 5 words and does not seem to vary significantly by peril:
# statistics of description length
df_train["words per description"] = df_train["Description"].str.split().apply(len)
print(f"Overall number of words by claim description: min {df_train['words per description'].min()}, "
f"average {df_train['words per description'].mean():.0f}, max {df_train['words per description'].max()}")
fig = px.box(df_train, x="labels", y="words per description", width=640)
fig.show(config={"toImageButtonOptions": {"format": "svg", "filename": "peril_len"}})
Overall number of words by claim description: min 1, average 5, max 11
To get an impression of the most frequent words, we generate a simple word cloud form all case descriptions. By default, the word cloud excludes so-called stop words (such as articles, prepositions, pronouns, conjunctions, etc.), which are the most common words and do not add much information to the text.
text = df_train["Description"].str.cat(sep=" ")
# Create and generate a word cloud image:
word_cloud = WordCloud(scale=5, background_color="white").generate(text)
# Display the generated image:
fig = px.imshow(word_cloud, width=1440)
fig.update_layout(xaxis_showticklabels=False, yaxis_showticklabels=False)
fig.show(config={"toImageButtonOptions": {"format": "svg", "filename": "peril_cloud"}})
In this section, we will train classifiers to predict the peril type (labels).
We will follow two approaches:
We use a transformer encoder to encode the claim descriptions, and then train a logistic regression classifier to predict the peril type from the encoded descriptions.
We train a transformer encoder with a classifier head directly.
Let's get started.
We follow the approach presented in Part I of this tutorial.
In this single-language case study, we use the distilbert-base-uncased model.
First, we load the model and the tokenizer.
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = AutoModel.from_pretrained(model_name).to(device)
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertModel: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight'] - This IS expected if you are initializing DistilBertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing DistilBertModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Then we define a function that applies the tokenizer to the column Description of an input batch...
# define a function to tokenize a batch
def tokenize(batch):
return tokenizer(batch["Description"], truncation=True, padding=True, max_length=12)
... and we apply this function to the entire dataset by use of the map function:
ds = ds.map(tokenize, batched=True)
Next, we apply the function extract_sequence_encoding (implemented in tutorial_utils.py)
which applies the model to a batch and extracts the last hidden state, which is the encoded input text.
ds = ds.map(extract_sequence_encoding, fn_kwargs={"model": model}, batched=True, batch_size=16)
We fit a dummy classifier (which always predicts the most frequent class) to the mean-pooled encodings.
We also fit a logistic regression classifier and evaluate its performance on the training and validation split.
For ease of use, this functionality is implemented in tutorial_utils.py).
x_train, y_train, x_test, y_test = get_xy(ds, "mean_hidden_state", "labels")
# fit dummy classifier
clf_dummy = dummy_classifier(x_train, y_train)
_ = evaluate_classifier(y_test, None, clf_dummy.predict_proba(x_test), labels, "Dummy classifier", None)
Dummy classifier
accuracy score = 29.8%, log loss = 1.977, Brier loss = 0.835
classification report
precision recall f1-score support
Vandalism 0.30 1.00 0.46 310
Fire 0.00 0.00 0.00 46
Lightning 0.00 0.00 0.00 123
Wind 0.00 0.00 0.00 107
Hail 0.00 0.00 0.00 18
Vehicle 0.00 0.00 0.00 227
WaterNW 0.00 0.00 0.00 67
WaterW 0.00 0.00 0.00 38
Misc 0.00 0.00 0.00 103
accuracy 0.30 1039
macro avg 0.03 0.11 0.05 1039
weighted avg 0.09 0.30 0.14 1039
# fit a logarithmic regression classifier to the encoded texts
clf = logistic_regression_classifier(x_train, y_train, c=0.2)
_ = evaluate_classifier(y_test, None, clf.predict_proba(x_test), labels, "Logistic Regression classifier", "cm_peril_lr")
Logistic Regression classifier
accuracy score = 83.9%, log loss = 0.531, Brier loss = 0.243
classification report
precision recall f1-score support
Vandalism 0.85 0.95 0.90 310
Fire 0.94 0.70 0.80 46
Lightning 0.90 0.93 0.91 123
Wind 0.91 0.87 0.89 107
Hail 0.93 0.78 0.85 18
Vehicle 0.90 0.92 0.91 227
WaterNW 0.72 0.34 0.46 67
WaterW 0.45 0.76 0.57 38
Misc 0.75 0.60 0.67 103
accuracy 0.84 1039
macro avg 0.82 0.76 0.77 1039
weighted avg 0.84 0.84 0.83 1039
This result is encouraging.
From the classification report, we see that the perils WaterNW, WaterW and Misc are most difficult to predict.
In this section, we train directly a transformer-based sequence classifier, using the approach described in Part I of this tutorial.
On an AWS EC2 p2.xlarge instance, the run time is about 2 minutes.
model_name = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(42) # for reproducibility, set random seed before instantiating the model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(labels)).to(device)
def compute_metrics(pred):
labels = pred.label_ids
preds = pred.predictions.argmax(-1)
f1 = f1_score(labels, preds, average="weighted")
acc = accuracy_score(labels, preds)
return {"accuracy": acc, "f1": f1}
# train the model
batch_size = 8
logging_steps = len(ds["train"]) // batch_size
training_args = TrainingArguments(
output_dir=model_name+"_peril_epochs",
num_train_epochs=2,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
metric_for_best_model="f1",
logging_steps=logging_steps,
save_strategy=trainer_utils.IntervalStrategy.NO,
)
trainer = Trainer(model=model, args=training_args,
compute_metrics=compute_metrics, train_dataset=ds["train"],
eval_dataset=ds["test"])
trainer.train();
trainer.save_model(model_name + "_peril")
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight'] - This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model). - This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model). Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: cls_hidden_state, WaterW, Hail, Misc, Vandalism, Fire, Lightning, Description, mean_hidden_state, Wind, Loss, Vehicle, WaterNW. If cls_hidden_state, WaterW, Hail, Misc, Vandalism, Fire, Lightning, Description, mean_hidden_state, Wind, Loss, Vehicle, WaterNW are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message. ***** Running training ***** Num examples = 4991 Num Epochs = 2 Instantaneous batch size per device = 8 Total train batch size (w. parallel, distributed & accumulation) = 8 Gradient Accumulation steps = 1 Total optimization steps = 1248
| Step | Training Loss |
|---|---|
| 623 | 0.530700 |
| 1246 | 0.304900 |
Training completed. Do not forget to share your model on huggingface.co/models =) Saving model checkpoint to distilbert-base-uncased_peril Configuration saved in distilbert-base-uncased_peril/config.json Model weights saved in distilbert-base-uncased_peril/pytorch_model.bin
We evaluate the model on the test set:
predictions = trainer.predict(ds["test"])
_ = evaluate_classifier(predictions.label_ids, None, softmax(predictions.predictions, axis=1), labels,"Transformer-based classifier", "cm_peril_transformer")
The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: cls_hidden_state, WaterW, Hail, Misc, Vandalism, Fire, Lightning, Description, mean_hidden_state, Wind, Loss, Vehicle, WaterNW. If cls_hidden_state, WaterW, Hail, Misc, Vandalism, Fire, Lightning, Description, mean_hidden_state, Wind, Loss, Vehicle, WaterNW are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message. ***** Running Prediction ***** Num examples = 1039 Batch size = 8
Transformer-based classifier
accuracy score = 85.2%, log loss = 0.520, Brier loss = 0.219
classification report
precision recall f1-score support
Vandalism 0.90 0.94 0.92 310
Fire 0.86 0.83 0.84 46
Lightning 0.94 0.94 0.94 123
Wind 0.96 0.91 0.93 107
Hail 1.00 0.94 0.97 18
Vehicle 0.92 0.93 0.93 227
WaterNW 0.81 0.19 0.31 67
WaterW 0.39 0.87 0.54 38
Misc 0.70 0.66 0.68 103
accuracy 0.85 1039
macro avg 0.83 0.80 0.78 1039
weighted avg 0.87 0.85 0.84 1039
The performance is comparable to that of the logistic regression classifier, with an improved Brier loss and accuracy score.
It appears that the model struggles to tell WaterNW apart from WaterW.
There are situations with no or only few labeled data.
Zero-shot classification is an approach that is suited in this case. Zero-shot classification is about classifying text sequences in an unsupervised way (without having training data in advance and building a model).
The model is presented with a text sequence and a list of expressions, and assigns a probability to each expression.
In this section you will learn how to apply zero-shot classification to perform the classification by peril type on the claims data described above.
First, we create a dictionary mapping certain verbal expression to peril types:
choices = OrderedDict({
"Vandalism": 0,
"Theft": 0,
"Fire": 1,
"Lightning": 2,
"Wind": 3,
"Hail": 4,
"Vehicle": 5,
"Water": 6,
"Weather": 7,
"Misc": 8})
We set up the zero-shot classifier using the pipeline abstraction.
By default, the facebook/bart-large-mnli model is used.
By specifying device=0, we use GPU support if available.
classifier = pipeline("zero-shot-classification", device=0)
No model was supplied, defaulted to facebook/bart-large-mnli (https://huggingface.co/facebook/bart-large-mnli) https://huggingface.co/facebook/bart-large-mnli/resolve/main/config.json not found in cache or force_download set to True, downloading to /home/ubuntu/.cache/huggingface/transformers/tmpm11ny4l0
storing https://huggingface.co/facebook/bart-large-mnli/resolve/main/config.json in cache at /home/ubuntu/.cache/huggingface/transformers/980f2be6bd282c5079e99199d7554cfd13000433ed0fdc527e7def799e5738fe.4fdc7ce6768977d347b32986aff152e26fcebbda34ef89ac9b114971d0342e09
creating metadata file for /home/ubuntu/.cache/huggingface/transformers/980f2be6bd282c5079e99199d7554cfd13000433ed0fdc527e7def799e5738fe.4fdc7ce6768977d347b32986aff152e26fcebbda34ef89ac9b114971d0342e09
loading configuration file https://huggingface.co/facebook/bart-large-mnli/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/980f2be6bd282c5079e99199d7554cfd13000433ed0fdc527e7def799e5738fe.4fdc7ce6768977d347b32986aff152e26fcebbda34ef89ac9b114971d0342e09
Model config BartConfig {
"_name_or_path": "facebook/bart-large-mnli",
"_num_labels": 3,
"activation_dropout": 0.0,
"activation_function": "gelu",
"add_final_layer_norm": false,
"architectures": [
"BartForSequenceClassification"
],
"attention_dropout": 0.0,
"bos_token_id": 0,
"classif_dropout": 0.0,
"classifier_dropout": 0.0,
"d_model": 1024,
"decoder_attention_heads": 16,
"decoder_ffn_dim": 4096,
"decoder_layerdrop": 0.0,
"decoder_layers": 12,
"decoder_start_token_id": 2,
"dropout": 0.1,
"encoder_attention_heads": 16,
"encoder_ffn_dim": 4096,
"encoder_layerdrop": 0.0,
"encoder_layers": 12,
"eos_token_id": 2,
"forced_eos_token_id": 2,
"gradient_checkpointing": false,
"id2label": {
"0": "contradiction",
"1": "neutral",
"2": "entailment"
},
"init_std": 0.02,
"is_encoder_decoder": true,
"label2id": {
"contradiction": 0,
"entailment": 2,
"neutral": 1
},
"max_position_embeddings": 1024,
"model_type": "bart",
"normalize_before": false,
"num_hidden_layers": 12,
"output_past": false,
"pad_token_id": 1,
"scale_embedding": false,
"transformers_version": "4.19.2",
"use_cache": true,
"vocab_size": 50265
}
loading configuration file https://huggingface.co/facebook/bart-large-mnli/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/980f2be6bd282c5079e99199d7554cfd13000433ed0fdc527e7def799e5738fe.4fdc7ce6768977d347b32986aff152e26fcebbda34ef89ac9b114971d0342e09
Model config BartConfig {
"_name_or_path": "facebook/bart-large-mnli",
"_num_labels": 3,
"activation_dropout": 0.0,
"activation_function": "gelu",
"add_final_layer_norm": false,
"architectures": [
"BartForSequenceClassification"
],
"attention_dropout": 0.0,
"bos_token_id": 0,
"classif_dropout": 0.0,
"classifier_dropout": 0.0,
"d_model": 1024,
"decoder_attention_heads": 16,
"decoder_ffn_dim": 4096,
"decoder_layerdrop": 0.0,
"decoder_layers": 12,
"decoder_start_token_id": 2,
"dropout": 0.1,
"encoder_attention_heads": 16,
"encoder_ffn_dim": 4096,
"encoder_layerdrop": 0.0,
"encoder_layers": 12,
"eos_token_id": 2,
"forced_eos_token_id": 2,
"gradient_checkpointing": false,
"id2label": {
"0": "contradiction",
"1": "neutral",
"2": "entailment"
},
"init_std": 0.02,
"is_encoder_decoder": true,
"label2id": {
"contradiction": 0,
"entailment": 2,
"neutral": 1
},
"max_position_embeddings": 1024,
"model_type": "bart",
"normalize_before": false,
"num_hidden_layers": 12,
"output_past": false,
"pad_token_id": 1,
"scale_embedding": false,
"transformers_version": "4.19.2",
"use_cache": true,
"vocab_size": 50265
}
https://huggingface.co/facebook/bart-large-mnli/resolve/main/pytorch_model.bin not found in cache or force_download set to True, downloading to /home/ubuntu/.cache/huggingface/transformers/tmpa0_v0_gr
storing https://huggingface.co/facebook/bart-large-mnli/resolve/main/pytorch_model.bin in cache at /home/ubuntu/.cache/huggingface/transformers/35014754ae1fcb956d44903df02e4f69d0917cab0901ace5ac7f4a4a998346fe.a30bb5d685bb3c6e9376ab4480f1b252d9796d438d1c84a9b2deb0275c5b2151 creating metadata file for /home/ubuntu/.cache/huggingface/transformers/35014754ae1fcb956d44903df02e4f69d0917cab0901ace5ac7f4a4a998346fe.a30bb5d685bb3c6e9376ab4480f1b252d9796d438d1c84a9b2deb0275c5b2151 loading weights file https://huggingface.co/facebook/bart-large-mnli/resolve/main/pytorch_model.bin from cache at /home/ubuntu/.cache/huggingface/transformers/35014754ae1fcb956d44903df02e4f69d0917cab0901ace5ac7f4a4a998346fe.a30bb5d685bb3c6e9376ab4480f1b252d9796d438d1c84a9b2deb0275c5b2151 All model checkpoint weights were used when initializing BartForSequenceClassification. All the weights of BartForSequenceClassification were initialized from the model checkpoint at facebook/bart-large-mnli. If your task is similar to the task the model of the checkpoint was trained on, you can already use BartForSequenceClassification for predictions without further training. https://huggingface.co/facebook/bart-large-mnli/resolve/main/tokenizer_config.json not found in cache or force_download set to True, downloading to /home/ubuntu/.cache/huggingface/transformers/tmpjgdxl14i
storing https://huggingface.co/facebook/bart-large-mnli/resolve/main/tokenizer_config.json in cache at /home/ubuntu/.cache/huggingface/transformers/569800088d6f014777e6d5d8cb61b2b8bb3d18a508a1d8af041aae6bbc6f3dfe.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8
creating metadata file for /home/ubuntu/.cache/huggingface/transformers/569800088d6f014777e6d5d8cb61b2b8bb3d18a508a1d8af041aae6bbc6f3dfe.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8
loading configuration file https://huggingface.co/facebook/bart-large-mnli/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/980f2be6bd282c5079e99199d7554cfd13000433ed0fdc527e7def799e5738fe.4fdc7ce6768977d347b32986aff152e26fcebbda34ef89ac9b114971d0342e09
Model config BartConfig {
"_name_or_path": "facebook/bart-large-mnli",
"_num_labels": 3,
"activation_dropout": 0.0,
"activation_function": "gelu",
"add_final_layer_norm": false,
"architectures": [
"BartForSequenceClassification"
],
"attention_dropout": 0.0,
"bos_token_id": 0,
"classif_dropout": 0.0,
"classifier_dropout": 0.0,
"d_model": 1024,
"decoder_attention_heads": 16,
"decoder_ffn_dim": 4096,
"decoder_layerdrop": 0.0,
"decoder_layers": 12,
"decoder_start_token_id": 2,
"dropout": 0.1,
"encoder_attention_heads": 16,
"encoder_ffn_dim": 4096,
"encoder_layerdrop": 0.0,
"encoder_layers": 12,
"eos_token_id": 2,
"forced_eos_token_id": 2,
"gradient_checkpointing": false,
"id2label": {
"0": "contradiction",
"1": "neutral",
"2": "entailment"
},
"init_std": 0.02,
"is_encoder_decoder": true,
"label2id": {
"contradiction": 0,
"entailment": 2,
"neutral": 1
},
"max_position_embeddings": 1024,
"model_type": "bart",
"normalize_before": false,
"num_hidden_layers": 12,
"output_past": false,
"pad_token_id": 1,
"scale_embedding": false,
"transformers_version": "4.19.2",
"use_cache": true,
"vocab_size": 50265
}
https://huggingface.co/facebook/bart-large-mnli/resolve/main/vocab.json not found in cache or force_download set to True, downloading to /home/ubuntu/.cache/huggingface/transformers/tmprnr4ah4m
storing https://huggingface.co/facebook/bart-large-mnli/resolve/main/vocab.json in cache at /home/ubuntu/.cache/huggingface/transformers/b4f8395edd321fd7cd8a87bca767b1135680a41d8931516dd1a447294633b9db.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05 creating metadata file for /home/ubuntu/.cache/huggingface/transformers/b4f8395edd321fd7cd8a87bca767b1135680a41d8931516dd1a447294633b9db.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05 https://huggingface.co/facebook/bart-large-mnli/resolve/main/merges.txt not found in cache or force_download set to True, downloading to /home/ubuntu/.cache/huggingface/transformers/tmptmbfskam
storing https://huggingface.co/facebook/bart-large-mnli/resolve/main/merges.txt in cache at /home/ubuntu/.cache/huggingface/transformers/19c09c9654551e163f858f3c99c226a8d0026acc4935528df3b09179204efe4c.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b creating metadata file for /home/ubuntu/.cache/huggingface/transformers/19c09c9654551e163f858f3c99c226a8d0026acc4935528df3b09179204efe4c.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b https://huggingface.co/facebook/bart-large-mnli/resolve/main/tokenizer.json not found in cache or force_download set to True, downloading to /home/ubuntu/.cache/huggingface/transformers/tmpct95b5ii
storing https://huggingface.co/facebook/bart-large-mnli/resolve/main/tokenizer.json in cache at /home/ubuntu/.cache/huggingface/transformers/540455855ce0a3c13893c5d090d142de9481365bd32dc5457c957e5d13444d23.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
creating metadata file for /home/ubuntu/.cache/huggingface/transformers/540455855ce0a3c13893c5d090d142de9481365bd32dc5457c957e5d13444d23.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
loading file https://huggingface.co/facebook/bart-large-mnli/resolve/main/vocab.json from cache at /home/ubuntu/.cache/huggingface/transformers/b4f8395edd321fd7cd8a87bca767b1135680a41d8931516dd1a447294633b9db.647b4548b6d9ea817e82e7a9231a320231a1c9ea24053cc9e758f3fe68216f05
loading file https://huggingface.co/facebook/bart-large-mnli/resolve/main/merges.txt from cache at /home/ubuntu/.cache/huggingface/transformers/19c09c9654551e163f858f3c99c226a8d0026acc4935528df3b09179204efe4c.5d12962c5ee615a4c803841266e9c3be9a691a924f72d395d3a6c6c81157788b
loading file https://huggingface.co/facebook/bart-large-mnli/resolve/main/tokenizer.json from cache at /home/ubuntu/.cache/huggingface/transformers/540455855ce0a3c13893c5d090d142de9481365bd32dc5457c957e5d13444d23.fc9576039592f026ad76a1c231b89aee8668488c671dfbe6616bab2ed298d730
loading file https://huggingface.co/facebook/bart-large-mnli/resolve/main/added_tokens.json from cache at None
loading file https://huggingface.co/facebook/bart-large-mnli/resolve/main/special_tokens_map.json from cache at None
loading file https://huggingface.co/facebook/bart-large-mnli/resolve/main/tokenizer_config.json from cache at /home/ubuntu/.cache/huggingface/transformers/569800088d6f014777e6d5d8cb61b2b8bb3d18a508a1d8af041aae6bbc6f3dfe.67d01b18f2079bd75eac0b2f2e7235768c7f26bd728e7a855a1c5acae01a91a8
loading configuration file https://huggingface.co/facebook/bart-large-mnli/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/980f2be6bd282c5079e99199d7554cfd13000433ed0fdc527e7def799e5738fe.4fdc7ce6768977d347b32986aff152e26fcebbda34ef89ac9b114971d0342e09
Model config BartConfig {
"_name_or_path": "facebook/bart-large-mnli",
"_num_labels": 3,
"activation_dropout": 0.0,
"activation_function": "gelu",
"add_final_layer_norm": false,
"architectures": [
"BartForSequenceClassification"
],
"attention_dropout": 0.0,
"bos_token_id": 0,
"classif_dropout": 0.0,
"classifier_dropout": 0.0,
"d_model": 1024,
"decoder_attention_heads": 16,
"decoder_ffn_dim": 4096,
"decoder_layerdrop": 0.0,
"decoder_layers": 12,
"decoder_start_token_id": 2,
"dropout": 0.1,
"encoder_attention_heads": 16,
"encoder_ffn_dim": 4096,
"encoder_layerdrop": 0.0,
"encoder_layers": 12,
"eos_token_id": 2,
"forced_eos_token_id": 2,
"gradient_checkpointing": false,
"id2label": {
"0": "contradiction",
"1": "neutral",
"2": "entailment"
},
"init_std": 0.02,
"is_encoder_decoder": true,
"label2id": {
"contradiction": 0,
"entailment": 2,
"neutral": 1
},
"max_position_embeddings": 1024,
"model_type": "bart",
"normalize_before": false,
"num_hidden_layers": 12,
"output_past": false,
"pad_token_id": 1,
"scale_embedding": false,
"transformers_version": "4.19.2",
"use_cache": true,
"vocab_size": 50265
}
Then, we feed the claims descriptions of the entire test set, presenting the classifier with the list of possible choices as the second argument.
We use the test set directly, because zero shot classification requires no training!
On an AWS EC2 p2.xlarge instance, the run time is about 5 minutes.
res = classifier(ds["test"]["Description"], list(choices.keys()))
Disabling tokenizer parallelism, we're using DataLoader multithreading already
This returns a list of dict with the following keys:
str) — The sequence for which this is the output.List[str]) — The labels sorted by order of likelihood.List[float]) — The probabilities for each of the labels.We store the predictions in a Pandas DataFrame and evaluate the performance.
proba = np.zeros((df_valid.shape[0], len(labels)))
for i, sample in enumerate(res):
for label, score in zip(sample["labels"], sample["scores"]):
proba[i, choices[label]] += score
proba[i, :] = proba[i, :] / np.sum(proba[i, :])
_ = evaluate_classifier(np.array(df_valid["labels"]), None, proba, labels, "Zero-shot-classification", "cm_peril_zero_a")
Zero-shot-classification
accuracy score = 65.5%, log loss = 1.043, Brier loss = 0.463
classification report
precision recall f1-score support
Vandalism 0.93 0.44 0.59 310
Fire 0.68 0.70 0.69 46
Lightning 0.94 0.93 0.94 123
Wind 1.00 0.84 0.91 107
Hail 0.75 1.00 0.86 18
Vehicle 0.89 0.69 0.77 227
WaterNW 0.56 0.75 0.64 67
WaterW 0.00 0.00 0.00 38
Misc 0.25 0.83 0.38 103
accuracy 0.66 1039
macro avg 0.67 0.69 0.64 1039
weighted avg 0.79 0.66 0.68 1039
On the test set, we achieve an accuracy of 65.5% (compared to 29.8% of the dummy classifier).
Apparently, the classifier struggles to correctly identify the WaterW cases based on the expression “Weather”.
Also, it seems that the expression “Misc” may not be the optimal choice, as it produces many false positives.
pred = [{
**{"pred"+str(i): choices[item["labels"][i]] for i in range(10)},
**{"score"+str(i): item["scores"][i] for i in range(10)}
} for item in res]
df_pred = pd.DataFrame(pred)
df_pred[["labels", "Description"]] = df_valid[["labels", "Description"]]
To improve the performance on "Misc", we introduce the following heuristic: If the probability assigned to the expression “Misc” is highest but with a margin of less than 50 percentage points to the second-most likely expression, we select the latter.
def select_misc(row, threshold):
return row["pred1"] if row["pred0"] == 8 and row["score0"] - row["score1"] < threshold else row["pred0"]
df_pred["pred*"] = df_pred.apply(lambda x: select_misc(x, 0.5), axis=1)
_ = evaluate_classifier(np.array(df_pred["labels"]), np.array(df_pred["pred*"]), None, labels, "Zero-shot classification, refined", "cm_peril_zero_b")
Zero-shot classification, refined
accuracy score = 69.7%, log loss = nan, Brier loss = nan
classification report
precision recall f1-score support
Vandalism 0.77 0.62 0.69 310
Fire 0.69 0.78 0.73 46
Lightning 0.92 0.94 0.93 123
Wind 0.91 0.85 0.88 107
Hail 0.58 1.00 0.73 18
Vehicle 0.62 0.77 0.69 227
WaterNW 0.54 0.78 0.63 67
WaterW 0.29 0.11 0.15 38
Misc 0.45 0.40 0.42 103
accuracy 0.70 1039
macro avg 0.64 0.69 0.65 1039
weighted avg 0.70 0.70 0.69 1039
We export the output to Excel to analyze the prediction errors.
if not os.path.exists("./results"):
os.mkdir("./results")
df_pred.to_excel("results/peril_pred_zero_shot.xlsx")
Looking at false predictions in the training set, we observe the following:
True label “Vandalism”, predicted label “Vehicle” or “Misc”: Quite many descriptions contain the word “glass”. For these claims, “Vandalism” appears to be a natural classification.
True label “Vehicle”, predicted label “Vandalism”: This group contains many descriptions like “light pole damaged”, “fence damaged”. Apparently, the zero-shot classifier does not realize that for these items, damage caused by a vehicle is more likely than damage caused by vandalism.
Based on these and similar observations, one could refine the approach by adding more candidate expressions, e.g., adding “glass” to hazard type 0 (“Vandalism”), “light pole” and “fence” to hazard type 5 (“Vehicle”), “storm” and “ice” to hazard type 7 (“WaterW”), etc.
However, the computational effort of zero-shot classification scales with the number of candidate expressions, so that we don't want to supply too many candidate expressions. Ideally, we would have an approach to extract candidate expressions from the data....
In the previous section we have seen the strength of zero-shot classification: No prior training of the language model is required to produce a classification of reasonable quality. However, it may be difficult to provide suitable candidate expressions.
In this section, we present an alternative approach.
The idea is to encode all text samples, to create clusters of "similar" documents and to extract meaningful verbal representations of the clusters.
Several packages are available to perform this task, e.g., BERTopic, Top2Vec and chat-intents. These packages use similar concepts but provide different APIs, hyper-parameters, diagnostics tools, etc.
Here, we use BERTopic.
The algorithm consists of the following steps:
Embed documents:
all-MiniLM-L6-v2, which is trained in English.
In the multi-lingual case it uses paraphrase-multilingual-MiniLM-L12-v2.Cluster documents:
Reduce the dimensionality of the embeddings.
This is required because the documents embeddings are high-dimensional,
and typically, clustering algorithms have difficulty clustering data in high dimensional space.
By default, BERTopic uses
UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction)
as it preserves both the local and global structure of embeddings quite well.
Create clusters of semantically similar documents.
By default, BERTopic uses
HDBSCAN
as it allows to identify outliers.
Create topic representation:
Extract and reduce topics with c-TF-IDF. This is a modification of TF-IDF, which applies TD-IDF to the concatenation of all documents within each document cluster, to obtain importance scores for the words within the cluster.
Improve coherence and diversity of words with Maximal Marginal Relevance, to find the most coherent words without having too much overlap between the words themselves. This results in the removal of words that do not contribute to a topic.
Let's apply the algorithm to our dataset and examine the results.
Normally, BERTopic instantiates UMAP and HDBSCAN automatically. Here, we instantiate them manually and pass them to BERTopic, for the following reasons:
For UMAP, we specify random_state=42, to improve reproducibility across runs. Please note that reproducibility across platforms is not guaranteed.
For HDBSCAN, we specify min_cluster_size=30 and min_samples=1 in order to control the number of clusters and the percentage of samples classified as outliers.
Otherwise, we use the default parameters used by BERTopic.
np.random.seed(42)
import random
random.seed(42)
umap_model = UMAP(n_neighbors=15, n_components=10, metric='cosine', low_memory=False, random_state=42)
hdbscan_model = HDBSCAN(min_cluster_size=30, metric='euclidean', prediction_data=True, min_samples=1)
topic_model = BERTopic(umap_model=umap_model, hdbscan_model=hdbscan_model)
topics, probs = topic_model.fit_transform(df_train["Description"])
topic_model.get_topic_info()
loading configuration file /home/ubuntu/.cache/torch/sentence_transformers/sentence-transformers_all-MiniLM-L6-v2/config.json
Model config BertConfig {
"_name_or_path": "/home/ubuntu/.cache/torch/sentence_transformers/sentence-transformers_all-MiniLM-L6-v2/",
"architectures": [
"BertModel"
],
"attention_probs_dropout_prob": 0.1,
"classifier_dropout": null,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 384,
"initializer_range": 0.02,
"intermediate_size": 1536,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 12,
"num_hidden_layers": 6,
"pad_token_id": 0,
"position_embedding_type": "absolute",
"transformers_version": "4.19.2",
"type_vocab_size": 2,
"use_cache": true,
"vocab_size": 30522
}
loading weights file /home/ubuntu/.cache/torch/sentence_transformers/sentence-transformers_all-MiniLM-L6-v2/pytorch_model.bin
All model checkpoint weights were used when initializing BertModel.
All the weights of BertModel were initialized from the model checkpoint at /home/ubuntu/.cache/torch/sentence_transformers/sentence-transformers_all-MiniLM-L6-v2/.
If your task is similar to the task the model of the checkpoint was trained on, you can already use BertModel for predictions without further training.
Didn't find file /home/ubuntu/.cache/torch/sentence_transformers/sentence-transformers_all-MiniLM-L6-v2/added_tokens.json. We won't load it.
loading file /home/ubuntu/.cache/torch/sentence_transformers/sentence-transformers_all-MiniLM-L6-v2/vocab.txt
loading file /home/ubuntu/.cache/torch/sentence_transformers/sentence-transformers_all-MiniLM-L6-v2/tokenizer.json
loading file None
loading file /home/ubuntu/.cache/torch/sentence_transformers/sentence-transformers_all-MiniLM-L6-v2/special_tokens_map.json
loading file /home/ubuntu/.cache/torch/sentence_transformers/sentence-transformers_all-MiniLM-L6-v2/tokenizer_config.json
| Topic | Count | Name | |
|---|---|---|---|
| 0 | -1 | 1049 | -1_vandalism_at_to_courthouse |
| 1 | 0 | 261 | 0_glass_vandalism_falk_es |
| 2 | 1 | 180 | 1_lightning_plant_dept_hall |
| 3 | 2 | 160 | 2_theft_of_stolen_break |
| 4 | 3 | 141 | 3_graffiti_on_kennedy_hoyt |
| 5 | 4 | 138 | 4_fire_smoke_mower_tic |
| 6 | 5 | 128 | 5_lightning_damage_scale_dpw |
| 7 | 6 | 112 | 6_park_vandalism_pavilion_dmg |
| 8 | 7 | 107 | 7_broken_door_glass_breakage |
| 9 | 8 | 104 | 8_signal_traffic_damaged_paradise |
| 10 | 9 | 103 | 9_surge_power_components_fuses |
| 11 | 10 | 87 | 10_roof_wind_shingles_blew |
| 12 | 11 | 83 | 11_hydrant_fire_hit_run |
| 13 | 12 | 80 | 12_llm_glass_mendota_hawk |
| 15 | 13 | 79 | 13_wind_course_golf_damage |
| 14 | 14 | 79 | 14_froze_pipe_pipes_frozen |
| 17 | 17 | 77 | 17_computer_lightning_to_equipment |
| 18 | 15 | 77 | 15_laptop_washington_ms_theft |
| 16 | 16 | 77 | 16_garage_door_hwy_shop |
| 19 | 18 | 74 | 18_water_damage_goodman_pool |
| 20 | 19 | 73 | 19_fence_gate_vehicle_damaged |
| 21 | 20 | 72 | 20_hail_buildings_roof_multiple |
| 22 | 21 | 69 | 21_building_truck_vehicle_by |
| 23 | 22 | 68 | 22_shelter_eastman_farlin_seymour |
| 24 | 23 | 66 | 23_pole_vehicle_hit_light |
| 25 | 24 | 64 | 24_pole_light_damaged_lightpole |
| 26 | 25 | 62 | 25_window_broken_windows_screens |
| 27 | 26 | 61 | 26_laptop_theft_from_of |
| 28 | 27 | 58 | 27_storm_multiple_sites_locations |
| 30 | 28 | 56 | 28_dmg_humboldt_lafollette_vandalism |
| 29 | 29 | 56 | 29_water_es_ms_west |
| 31 | 30 | 54 | 30_vandalism_damage_bandshell_odonnell |
| 32 | 31 | 51 | 31_airport_lightning_lights_runway |
| 33 | 32 | 51 | 32_center_water_dept_bldg |
| 34 | 33 | 50 | 33_vandalism_lemonweir_lock_gazebo |
| 35 | 34 | 48 | 34_phone_system_phones_telephone |
| 36 | 35 | 48 | 35_street_light_damaged_lights |
| 37 | 36 | 47 | 36_school_water_elementary_high |
| 38 | 37 | 46 | 37_well_meter_flow_lightning |
| 39 | 38 | 45 | 38_sign_vehicle_signal_traffic |
| 40 | 39 | 44 | 39_wind_fence_park_trees |
| 41 | 40 | 44 | 40_overhead_door_damaged_loader |
| 42 | 41 | 44 | 41_equipment_playground_slide_gps |
| 43 | 42 | 42 | 42_hs_lightning_hhs_damage |
| 44 | 43 | 41 | 43_park_washington_vandalism_jacobus |
| 45 | 44 | 38 | 44_hs_water_tremper_pw |
| 46 | 45 | 38 | 45_radio_antenna_lightning_radios |
| 47 | 46 | 38 | 46_water_equipment_carpet_computers |
| 48 | 47 | 34 | 47_street_pole_light_streetlight |
| 49 | 48 | 33 | 48_compressor_unit_air_hvac |
| 50 | 49 | 33 | 49_gym_floor_injured_k9 |
| 51 | 50 | 33 | 50_hydrant_vehicle_struck_hit |
| 52 | 51 | 33 | 51_fire_park_smoke_arson |
| 53 | 52 | 32 | 52_power_outage_well_surge |
| 54 | 53 | 32 | 53_ice_dam_museum_dock |
| 55 | 54 | 31 | 54_radio_lost_dropped_portable |
| 56 | 55 | 30 | 55_tower_lightning_north_internet |
| 57 | 56 | 30 | 56_buildings_building_water_basement |
The first output of fit_transform holds the topic ID for each sample. The second output is the probability of the sample belonging to that topic.
In our case, we have obtained ca. 50 clusters. due to randomness of UMAP, the results may differ between runs. Unfortunately, we have not found a way to fix this.
The cluster with ID -1 contains all samples which are considered "noise" because they were not attributed to any cluster.
The function get_topic_info returns the topic ID, the sample count, and a concatenation of the words representing the cluster.
To get a visual impression of the clusters, BERTopic provides the function visualize_topics which embeds the c-TF-IDF representation of the topics in 2D using UMAP and then visualizes the two dimensions using plotly in an interactive view.
topic_model.visualize_topics()
We can visualize the selected terms for a few topics by creating bar charts out of the c-TF-IDF scores for each topic representation.
Insights can be gained from the relative c-TF-IDF scores between and within topics. Moreover, you can easily compare topic representations to each other. To visualize this hierarchy, simply call the function visualize_barchart:
topic_model.visualize_barchart(top_n_topics=4)
BERTopics creates topics in a hierarchical structure. The function visualize_hierarchy displays the hierarchy. This information is useful to reduce the number of topics, either by specifying a value for the parameter nr_topics upon instantiation of BERTopic, or after the training by calling the function reduce_topics.
topic_model.visualize_hierarchy()
Next, we want to assign labels to each cluster. Compared to manually labeling thousands of samples, this task is much less burdensome!
This is usually a manual task. Assignment of labels is guided by the topic information, the topic word scores and the hierarchical clustering.
In our case, the actual labels are available, so that we can use this information to perform the labeling.
Let's inspect how well the clusters matches the labels. The graph below shows one column per topic. The shading indicates the distribution of labels within a given topic. The presence of a single dark patch in a column indicates that almost all of the samples of the topic are associated with a single label.
df_train["Topic"] = topics
tb = pd.pivot_table(df_train, index=["Topic"], columns=["labels"], aggfunc='count', fill_value=0)["Description"]
fig = px.imshow(tb.divide(tb.sum(axis=1), axis=0).T, zmin=-0.05)
fig.update_layout(xaxis={"dtick": 1}, yaxis={"dtick": 1, "range":[0,8]}, coloraxis={"colorscale": "Greys"})
fig.show()
Obviously, the topic -1, which represents the outliers, has a finite frequency for many classes.
Further, the classes 6 (WaterNW) and 7 (WaterW) seem to be difficult to tell apart from the clusters; this affects some of the topics.
For most other topics, the clustering aligns quite well with the labels.
Overall, it appears reasonable to map each topic to the label with the highest frequency. Apart from the exceptions mentioned above, this aligns with a mapping that a human would define manually, in absence of the actual labels.
Therefore, let's define the mapping from topics to labels by picking the label with the highest frequency. The table below shows the topic info, enriched with the label counts and the mapping.
tb["mapping"] = tb.values.argmax(axis=1)
tb["label"] = [labels[i] for i in tb["mapping"]]
mapping = {i: tb.loc[i, "mapping"] for i in tb.index}
topic_model.get_topic_info().merge(tb, on="Topic")
| Topic | Count | Name | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | mapping | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -1 | 1049 | -1_vandalism_at_to_courthouse | 452 | 20 | 156 | 70 | 1 | 113 | 48 | 107 | 82 | 0 | Vandalism |
| 1 | 0 | 261 | 0_glass_vandalism_falk_es | 261 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Vandalism |
| 2 | 1 | 180 | 1_lightning_plant_dept_hall | 0 | 1 | 178 | 0 | 0 | 1 | 0 | 0 | 0 | 2 | Lightning |
| 3 | 2 | 160 | 2_theft_of_stolen_break | 127 | 1 | 0 | 0 | 0 | 6 | 1 | 0 | 25 | 0 | Vandalism |
| 4 | 3 | 141 | 3_graffiti_on_kennedy_hoyt | 140 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | Vandalism |
| 5 | 4 | 138 | 4_fire_smoke_mower_tic | 11 | 109 | 3 | 0 | 0 | 8 | 2 | 0 | 5 | 1 | Fire |
| 6 | 5 | 128 | 5_lightning_damage_scale_dpw | 0 | 0 | 127 | 0 | 1 | 0 | 0 | 0 | 0 | 2 | Lightning |
| 7 | 6 | 112 | 6_park_vandalism_pavilion_dmg | 111 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | Vandalism |
| 8 | 7 | 107 | 7_broken_door_glass_breakage | 96 | 0 | 0 | 1 | 0 | 3 | 0 | 0 | 7 | 0 | Vandalism |
| 9 | 8 | 104 | 8_signal_traffic_damaged_paradise | 0 | 0 | 1 | 1 | 0 | 101 | 0 | 0 | 1 | 5 | Vehicle |
| 10 | 9 | 103 | 9_surge_power_components_fuses | 0 | 1 | 7 | 0 | 0 | 0 | 1 | 0 | 94 | 8 | Misc |
| 11 | 10 | 87 | 10_roof_wind_shingles_blew | 2 | 1 | 0 | 59 | 1 | 0 | 1 | 10 | 13 | 3 | Wind |
| 12 | 11 | 83 | 11_hydrant_fire_hit_run | 0 | 0 | 0 | 0 | 0 | 81 | 1 | 0 | 1 | 5 | Vehicle |
| 13 | 12 | 80 | 12_llm_glass_mendota_hawk | 72 | 0 | 2 | 0 | 0 | 2 | 0 | 1 | 3 | 0 | Vandalism |
| 14 | 13 | 79 | 13_wind_course_golf_damage | 0 | 0 | 0 | 78 | 0 | 0 | 0 | 0 | 1 | 3 | Wind |
| 15 | 14 | 79 | 14_froze_pipe_pipes_frozen | 2 | 0 | 1 | 0 | 0 | 8 | 17 | 38 | 13 | 7 | WaterW |
| 16 | 17 | 77 | 17_computer_lightning_to_equipment | 1 | 2 | 72 | 1 | 0 | 0 | 0 | 0 | 1 | 2 | Lightning |
| 17 | 15 | 77 | 15_laptop_washington_ms_theft | 65 | 0 | 0 | 0 | 0 | 5 | 0 | 0 | 7 | 0 | Vandalism |
| 18 | 16 | 77 | 16_garage_door_hwy_shop | 1 | 0 | 0 | 1 | 0 | 72 | 0 | 0 | 3 | 5 | Vehicle |
| 19 | 18 | 74 | 18_water_damage_goodman_pool | 0 | 0 | 0 | 0 | 0 | 0 | 26 | 47 | 1 | 7 | WaterW |
| 20 | 19 | 73 | 19_fence_gate_vehicle_damaged | 6 | 0 | 0 | 6 | 0 | 60 | 0 | 0 | 1 | 5 | Vehicle |
| 21 | 20 | 72 | 20_hail_buildings_roof_multiple | 0 | 0 | 0 | 4 | 68 | 0 | 0 | 0 | 0 | 4 | Hail |
| 22 | 21 | 69 | 21_building_truck_vehicle_by | 1 | 0 | 0 | 4 | 0 | 56 | 0 | 2 | 6 | 5 | Vehicle |
| 23 | 22 | 68 | 22_shelter_eastman_farlin_seymour | 68 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Vandalism |
| 24 | 23 | 66 | 23_pole_vehicle_hit_light | 0 | 0 | 0 | 0 | 1 | 63 | 0 | 0 | 2 | 5 | Vehicle |
| 25 | 24 | 64 | 24_pole_light_damaged_lightpole | 4 | 0 | 0 | 0 | 1 | 57 | 0 | 0 | 2 | 5 | Vehicle |
| 26 | 25 | 62 | 25_window_broken_windows_screens | 55 | 1 | 0 | 1 | 0 | 1 | 0 | 2 | 2 | 0 | Vandalism |
| 27 | 26 | 61 | 26_laptop_theft_from_of | 53 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 7 | 0 | Vandalism |
| 28 | 27 | 58 | 27_storm_multiple_sites_locations | 0 | 0 | 27 | 16 | 2 | 0 | 0 | 13 | 0 | 2 | Lightning |
| 29 | 28 | 56 | 28_dmg_humboldt_lafollette_vandalism | 56 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Vandalism |
| 30 | 29 | 56 | 29_water_es_ms_west | 2 | 0 | 0 | 0 | 0 | 0 | 24 | 30 | 0 | 7 | WaterW |
| 31 | 30 | 54 | 30_vandalism_damage_bandshell_odonnell | 53 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | Vandalism |
| 32 | 31 | 51 | 31_airport_lightning_lights_runway | 0 | 1 | 47 | 1 | 0 | 2 | 0 | 0 | 0 | 2 | Lightning |
| 33 | 32 | 51 | 32_center_water_dept_bldg | 0 | 0 | 0 | 0 | 0 | 2 | 18 | 30 | 1 | 7 | WaterW |
| 34 | 33 | 50 | 33_vandalism_lemonweir_lock_gazebo | 49 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Vandalism |
| 35 | 34 | 48 | 34_phone_system_phones_telephone | 3 | 0 | 36 | 0 | 0 | 0 | 1 | 3 | 5 | 2 | Lightning |
| 36 | 35 | 48 | 35_street_light_damaged_lights | 5 | 0 | 0 | 2 | 1 | 39 | 0 | 1 | 0 | 5 | Vehicle |
| 37 | 36 | 47 | 36_school_water_elementary_high | 1 | 0 | 1 | 0 | 0 | 0 | 12 | 32 | 1 | 7 | WaterW |
| 38 | 37 | 46 | 37_well_meter_flow_lightning | 0 | 1 | 40 | 0 | 0 | 3 | 0 | 0 | 2 | 2 | Lightning |
| 39 | 38 | 45 | 38_sign_vehicle_signal_traffic | 4 | 0 | 0 | 0 | 0 | 41 | 0 | 0 | 0 | 5 | Vehicle |
| 40 | 39 | 44 | 39_wind_fence_park_trees | 0 | 0 | 0 | 42 | 0 | 1 | 0 | 1 | 0 | 3 | Wind |
| 41 | 40 | 44 | 40_overhead_door_damaged_loader | 2 | 0 | 0 | 1 | 0 | 39 | 0 | 0 | 2 | 5 | Vehicle |
| 42 | 41 | 44 | 41_equipment_playground_slide_gps | 12 | 0 | 0 | 1 | 0 | 12 | 0 | 0 | 19 | 8 | Misc |
| 43 | 42 | 42 | 42_hs_lightning_hhs_damage | 0 | 0 | 42 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | Lightning |
| 44 | 43 | 41 | 43_park_washington_vandalism_jacobus | 41 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Vandalism |
| 45 | 44 | 38 | 44_hs_water_tremper_pw | 0 | 1 | 0 | 0 | 0 | 0 | 10 | 26 | 1 | 7 | WaterW |
| 46 | 45 | 38 | 45_radio_antenna_lightning_radios | 0 | 1 | 37 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | Lightning |
| 47 | 46 | 38 | 46_water_equipment_carpet_computers | 0 | 0 | 1 | 0 | 0 | 0 | 13 | 23 | 1 | 7 | WaterW |
| 48 | 47 | 34 | 47_street_pole_light_streetlight | 0 | 0 | 0 | 0 | 0 | 33 | 0 | 0 | 1 | 5 | Vehicle |
| 49 | 48 | 33 | 48_compressor_unit_air_hvac | 0 | 1 | 21 | 0 | 0 | 2 | 0 | 2 | 7 | 2 | Lightning |
| 50 | 49 | 33 | 49_gym_floor_injured_k9 | 4 | 0 | 0 | 0 | 0 | 0 | 7 | 14 | 8 | 7 | WaterW |
| 51 | 50 | 33 | 50_hydrant_vehicle_struck_hit | 2 | 0 | 0 | 0 | 0 | 29 | 1 | 0 | 1 | 5 | Vehicle |
| 52 | 51 | 33 | 51_fire_park_smoke_arson | 5 | 27 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | Fire |
| 53 | 52 | 32 | 52_power_outage_well_surge | 1 | 1 | 3 | 2 | 0 | 1 | 1 | 2 | 21 | 8 | Misc |
| 54 | 53 | 32 | 53_ice_dam_museum_dock | 1 | 0 | 0 | 2 | 0 | 0 | 2 | 24 | 3 | 7 | WaterW |
| 55 | 54 | 31 | 54_radio_lost_dropped_portable | 4 | 1 | 0 | 1 | 0 | 8 | 5 | 1 | 11 | 8 | Misc |
| 56 | 55 | 30 | 55_tower_lightning_north_internet | 0 | 0 | 30 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | Lightning |
| 57 | 56 | 30 | 56_buildings_building_water_basement | 1 | 0 | 0 | 1 | 0 | 1 | 11 | 16 | 0 | 7 | WaterW |
Now, let's apply this model to the validation set. First, we assign each sample to a cluster, based on the clustering model.
topics_test, probs_test = topic_model.transform(df_valid["Description"])
/home/ubuntu/anaconda3/envs/pytorch_latest_p37/lib/python3.7/site-packages/scipy/sparse/_index.py:125: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
Then, we apply the mapping from topics to labels, which we have defined above based on the training set. The table below shows for each topic the frequency by label, and the mapping.
df_valid["Topic"] = topics_test
df_valid["prob"] = probs_test
df_valid["pred"] = [mapping[t] for t in topics_test]
df_valid.to_excel("results/peril_topics.xlsx")
tb_valid = pd.pivot_table(df_valid, index=["Topic"], columns=["labels"], aggfunc='count', fill_value=0)["Description"]
tb_valid["mapping"] = [mapping[t] for t in tb_valid.index]
tb_valid
| labels | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | mapping |
|---|---|---|---|---|---|---|---|---|---|---|
| Topic | ||||||||||
| -1 | 57 | 11 | 22 | 22 | 0 | 28 | 22 | 13 | 31 | 0 |
| 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 12 | 1 | 0 | 0 | 1 | 0 | 0 | 2 |
| 2 | 17 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 9 | 0 |
| 3 | 59 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 0 |
| 4 | 2 | 25 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 5 | 0 | 0 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 6 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 7 | 18 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| 8 | 1 | 0 | 0 | 2 | 0 | 10 | 0 | 0 | 1 | 5 |
| 9 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 8 |
| 10 | 0 | 0 | 0 | 27 | 0 | 0 | 0 | 1 | 2 | 3 |
| 11 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 0 | 5 |
| 12 | 104 | 1 | 0 | 1 | 0 | 5 | 1 | 1 | 1 | 0 |
| 13 | 0 | 0 | 0 | 35 | 0 | 0 | 0 | 1 | 1 | 3 |
| 14 | 2 | 1 | 0 | 0 | 0 | 4 | 11 | 3 | 4 | 7 |
| 15 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 |
| 16 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 0 | 5 |
| 17 | 0 | 1 | 26 | 0 | 0 | 0 | 0 | 0 | 2 | 2 |
| 18 | 0 | 0 | 0 | 0 | 0 | 0 | 7 | 3 | 0 | 7 |
| 19 | 1 | 0 | 0 | 0 | 0 | 9 | 0 | 0 | 1 | 5 |
| 20 | 0 | 0 | 0 | 2 | 18 | 0 | 0 | 0 | 0 | 4 |
| 21 | 1 | 0 | 0 | 0 | 0 | 15 | 0 | 0 | 3 | 5 |
| 22 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 23 | 0 | 0 | 0 | 1 | 0 | 47 | 0 | 0 | 0 | 5 |
| 24 | 0 | 0 | 1 | 1 | 0 | 6 | 0 | 0 | 1 | 5 |
| 25 | 16 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 0 |
| 26 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5 | 0 |
| 27 | 0 | 0 | 8 | 5 | 0 | 0 | 0 | 3 | 0 | 2 |
| 29 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 7 |
| 30 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 31 | 0 | 0 | 11 | 1 | 0 | 0 | 0 | 0 | 0 | 2 |
| 32 | 0 | 0 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 7 |
| 33 | 5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 34 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 35 | 1 | 0 | 0 | 0 | 0 | 19 | 0 | 0 | 0 | 5 |
| 36 | 0 | 2 | 0 | 0 | 0 | 0 | 3 | 1 | 0 | 7 |
| 37 | 0 | 1 | 3 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 38 | 0 | 0 | 0 | 0 | 0 | 22 | 0 | 0 | 2 | 5 |
| 39 | 0 | 0 | 0 | 7 | 0 | 1 | 0 | 0 | 0 | 3 |
| 40 | 3 | 0 | 1 | 2 | 0 | 15 | 0 | 0 | 1 | 5 |
| 41 | 1 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 8 |
| 44 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 7 |
| 45 | 0 | 0 | 12 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 46 | 0 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 7 |
| 47 | 0 | 0 | 0 | 0 | 0 | 3 | 0 | 0 | 1 | 5 |
| 48 | 0 | 0 | 4 | 0 | 0 | 1 | 0 | 0 | 2 | 2 |
| 49 | 0 | 0 | 0 | 0 | 0 | 1 | 6 | 3 | 2 | 7 |
| 50 | 0 | 0 | 0 | 0 | 0 | 7 | 0 | 0 | 0 | 5 |
| 51 | 1 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| 52 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 6 | 8 |
| 53 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 5 | 1 | 7 |
| 54 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 3 | 8 |
| 55 | 0 | 1 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| 56 | 0 | 1 | 0 | 0 | 0 | 0 | 4 | 3 | 2 | 7 |
This classifier achieves an accuracy score of ca. 70%, compared to 30% obtained with the dummy classifier.
_ = evaluate_classifier(df_valid["labels"], df_valid["pred"], None, labels, "Topic modeling by clustering", "cm_peril_topic_a")
Topic modeling by clustering
accuracy score = 69.9%, log loss = nan, Brier loss = nan
classification report
precision recall f1-score support
Vandalism 0.62 0.95 0.75 310
Fire 0.87 0.59 0.70 46
Lightning 0.83 0.77 0.80 123
Wind 0.92 0.64 0.76 107
Hail 0.90 1.00 0.95 18
Vehicle 0.88 0.79 0.83 227
WaterNW 0.00 0.00 0.00 67
WaterW 0.23 0.50 0.31 38
Misc 0.77 0.22 0.35 103
accuracy 0.70 1039
macro avg 0.67 0.61 0.60 1039
weighted avg 0.71 0.70 0.67 1039
BERTopic provides the function find_topics which returns a list of IDs and similarity scores of topics that best match a given search term.
This is useful to validate the mapping. Let's use the search term "Fire" and retrieve the three most similar topics. For each of these topics, we print the similarity score and the label it was mapped to. We also show the word scores for each topic.
similar_topics, similarity = topic_model.find_topics("Fire", top_n=3)
for t, s in zip(similar_topics, similarity):
print(f"topic {t:2d}: similarity score {s:.1%}, mapped to peril {mapping[t]:d} ({labels[mapping[t]]})")
topic_model.visualize_barchart(similar_topics)
topic 4: similarity score 90.7%, mapped to peril 1 (Fire) topic 51: similarity score 85.5%, mapped to peril 1 (Fire) topic 11: similarity score 68.9%, mapped to peril 5 (Vehicle)
As expected, the topics which have been mapped to "Fire" appear first in the list, with similarity scores of more than 80%.
The first topic that was not mapped to "Fire" has a similarity score of less than 70%. It was mapped to the label "Vehicle". Indeed: Although the word "Fire" ranks second in the word score, this is in combination with hydrant. This is about vehicles hitting fire hydrants.
Above, a relatively large number of samples was classified as outlier. All outliers were mapped to a single class, but this mapping is questionable, because we have seen that outlier samples belong to different classes.
To mitigate this issue, we could label the outlier samples manually. However, this is quite tedious.
Alternatively, we can train a classifier to the labels obtained from the unsupervised approach. To avoid label noise, we suppress outliers.
First, we create the training dataset. We replace the true labels by the labels obtained from the clustering approach.
df_train_unsupervised = df_train[df_train["Topic"]>=0].copy()
df_train_unsupervised["labels"] = [mapping[t] for t in df_train_unsupervised["Topic"]]
ds_train_unsupervised = Dataset.from_pandas(df_train_unsupervised)
ds_train_unsupervised = ds_train_unsupervised.map(tokenize, batched=True)
model_name = "distilbert-base-uncased"
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
torch.manual_seed(42) # for reproducibility, set random seed before instantiating the model
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=len(labels)).to(device)
# train the model
batch_size = 8
logging_steps = len(ds_train_unsupervised) // batch_size
training_args = TrainingArguments(
output_dir=model_name+"_peril_u_epochs",
num_train_epochs=2,
per_device_train_batch_size=batch_size,
per_device_eval_batch_size=batch_size,
metric_for_best_model="f1",
logging_steps=logging_steps,
save_strategy=trainer_utils.IntervalStrategy.NO,
)
trainer = Trainer(model=model, args=training_args,
compute_metrics=compute_metrics, train_dataset=ds_train_unsupervised,
eval_dataset=ds["test"])
trainer.train();
trainer.save_model(model_name + "_peril_u")
loading configuration file https://huggingface.co/distilbert-base-uncased/resolve/main/config.json from cache at /home/ubuntu/.cache/huggingface/transformers/23454919702d26495337f3da04d1655c7ee010d5ec9d77bdb9e399e00302c0a1.91b885ab15d631bf9cee9dc9d25ece0afd932f2f5130eba28f2055b2220c0333
Model config DistilBertConfig {
"_name_or_path": "distilbert-base-uncased",
"activation": "gelu",
"architectures": [
"DistilBertForMaskedLM"
],
"attention_dropout": 0.1,
"dim": 768,
"dropout": 0.1,
"hidden_dim": 3072,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1",
"2": "LABEL_2",
"3": "LABEL_3",
"4": "LABEL_4",
"5": "LABEL_5",
"6": "LABEL_6",
"7": "LABEL_7",
"8": "LABEL_8"
},
"initializer_range": 0.02,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1,
"LABEL_2": 2,
"LABEL_3": 3,
"LABEL_4": 4,
"LABEL_5": 5,
"LABEL_6": 6,
"LABEL_7": 7,
"LABEL_8": 8
},
"max_position_embeddings": 512,
"model_type": "distilbert",
"n_heads": 12,
"n_layers": 6,
"pad_token_id": 0,
"qa_dropout": 0.1,
"seq_classif_dropout": 0.2,
"sinusoidal_pos_embds": false,
"tie_weights_": true,
"transformers_version": "4.19.2",
"vocab_size": 30522
}
loading weights file https://huggingface.co/distilbert-base-uncased/resolve/main/pytorch_model.bin from cache at /home/ubuntu/.cache/huggingface/transformers/9c169103d7e5a73936dd2b627e42851bec0831212b677c637033ee4bce9ab5ee.126183e36667471617ae2f0835fab707baa54b731f991507ebbb55ea85adb12a
Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_transform.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.weight', 'classifier.bias', 'pre_classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
PyTorch: setting up devices
The default value for the training argument `--report_to` will change in v5 (from all installed integrations to none). In v5, you will need to use `--report_to all` to get the same behavior as now. You should start updating your code and make this info disappear :-).
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: WaterW, Hail, Topic, Misc, Vandalism, Fire, Lightning, Description, Wind, words per description, Loss, __index_level_0__, Vehicle, WaterNW. If WaterW, Hail, Topic, Misc, Vandalism, Fire, Lightning, Description, Wind, words per description, Loss, __index_level_0__, Vehicle, WaterNW are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message.
***** Running training *****
Num examples = 3942
Num Epochs = 2
Instantaneous batch size per device = 8
Total train batch size (w. parallel, distributed & accumulation) = 8
Gradient Accumulation steps = 1
Total optimization steps = 986
| Step | Training Loss |
|---|---|
| 492 | 0.365700 |
| 984 | 0.068800 |
Training completed. Do not forget to share your model on huggingface.co/models =) Saving model checkpoint to distilbert-base-uncased_peril_u Configuration saved in distilbert-base-uncased_peril_u/config.json Model weights saved in distilbert-base-uncased_peril_u/pytorch_model.bin
Then, we evaluate the classifier on the test set, by comparing the predicted to the true labels.
predictions = trainer.predict(ds["test"])
_ = evaluate_classifier(predictions.label_ids, None, softmax(predictions.predictions, axis=1), labels, "Topic modeling by clustering, refined", "cm_peril_topic_b")
The following columns in the test set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: cls_hidden_state, WaterW, Hail, Misc, Vandalism, Fire, Lightning, Description, mean_hidden_state, Wind, Loss, Vehicle, WaterNW. If cls_hidden_state, WaterW, Hail, Misc, Vandalism, Fire, Lightning, Description, mean_hidden_state, Wind, Loss, Vehicle, WaterNW are not expected by `DistilBertForSequenceClassification.forward`, you can safely ignore this message. ***** Running Prediction ***** Num examples = 1039 Batch size = 8
Topic modeling by clustering, refined
accuracy score = 79.3%, log loss = 1.486, Brier loss = 0.389
classification report
precision recall f1-score support
Vandalism 0.87 0.92 0.89 310
Fire 0.81 0.74 0.77 46
Lightning 0.90 0.95 0.92 123
Wind 0.87 0.84 0.86 107
Hail 0.90 1.00 0.95 18
Vehicle 0.87 0.92 0.89 227
WaterNW 0.00 0.00 0.00 67
WaterW 0.26 0.84 0.40 38
Misc 0.73 0.40 0.52 103
accuracy 0.79 1039
macro avg 0.69 0.73 0.69 1039
weighted avg 0.78 0.79 0.78 1039
The accuracy score has improved significantly.
Congratulations!
In this Part II of the tutorial, you have first applied the techniques you have learned in Part I to a dataset with shorter texts.
Then you have learned how to use zero shot classification in a situation with no labels. The beauty of this approach is that it requires no training and produces a reasonable classification by a list of user-defined expressions.
Going one step further, you have seen an approach that creates clusters of similar documents and represents each cluster by typical words. This can be used as a starting point to create meaningful labels.
If you have enjoyed this tutorial, feel free to apply any of the approaches - or improved versions, of course - to your own text data, to enrich your structured features available for supervised learning tasks.